Integrated circuit device

ABSTRACT

In an integrated circuit device that includes a first memory that is capable of inputting data into and/or outputting data from a second memory and a processing unit in which at least part of at least one data flow is changeable, the processing unit includes, in addition to a data processing section that processes data that is inputted from and/or outputted to the first memory, a first address outputting section that outputs a first address of data that is inputted and/or outputted between the first memory and the data processing section and a second address outputting section that outputs a second address of data that is inputted and/or outputted between the first memory and the second memory. By using part of the processing unit, where a data flow can be changed or reconfigured, for configuring a circuit that controls the memories, a cache memory system that is optimal for the processing executed by the integrated circuit device can be configured in the integrated circuit device.

TECHNICAL FIELD

[0001] The present invention relates to an integrated circuit device inwhich data flows can be reconfigured.

RELATED ART

[0002] When data and/or instructions (hereinafter referred to as “data”where there is no particular need to distinguish between “instructions”and “data”), which are stored in a memory, such as a RAM, a ROM, or amagnetic disc, are processed by a CPU or the like, a high speed memorycalled a “cache” or “cache memory” that has a comparatively smallcapacity is used and the access speed for the data is improved byutilizing the temporal locality and/or spatially locality of the data.Accordingly, in an integrated circuit device such as a VLSI, a systemLSI, or a system ASIC where a processor or a processor core isincorporated, a cache system comprising a cache memory and an MMU(Memory Management Unit) for controlling the cache memory is alsoincorporated.

[0003] When a cache memory is used, an MMU and a TLB (TranslationLook-Aside Buffer) are used, so that when the data corresponding to avirtual or logical address outputted from the CPU core is present in thecache memory, data is inputted and outputted between the cache memoryand the CPU core. When the data is not present in the cache memory, thevirtual address is converted into a physical address by the MMU and theTLB and an input/output is generated for an external memory, and thedata in the cache memory is also updated. In this way, due to the cachecontrol mechanism that comprises the MMU and the like, the cache memoryis constructed as a device that appears to be transparent to thesoftware that is executed by the CPU core. Accordingly, software can bedeveloped so as to operate based on virtual addresses that do not dependon hardware, which makes it possible to reduce the time taken and costincurred by software development and design. Also, the same software canbe run on different hardware, which means that software resources can beused effectively.

[0004] When the data at the virtual address outputted from the CPU coreis not present in the cache memory, which is to say, when a “hit” doesnot occur for the cache memory, an input/output process occurs for anexternal memory. When the hit rate of the cache memory is low, the cachememory becomes merely an overhead that detrimentally affects theexecution time of programs. In order to improve the hit rate, studiesare being performed into techniques such as separating the cache into aninstruction cache and a data cache, constructing the cache with ahierarchical structure, or prefetching data mechanically and/or usingsoftware.

[0005] However, when applying the cache that is separated into aninstruction cache and a data cache, if instructions and data aresimultaneously present in one block, it becomes difficult to handle theinstructions and the data. For example, rewriting instructions may causeobstruct to software processing. Also, in software where instructionsand data are not accessed equally, there is no improvement in efficiencyby simply separating the cache. For example, when the accesses to dataare sporadic, the usage efficiency of the data cache is low, so thatthere is the possibility of this becoming an overhead.

[0006] A hierarchical cache is effective when there are largedifferences in access time and storage capacity between the cache andthe external memory. However, when the cache is constructedhierarchically, there is an inevitable rise in the number of accesses tothe memory, so that there is always the possibility of overheadsdepending on conditions such as the structure of the software and theinput/output media for the data being processed.

[0007] Even when prefetching is performed, penalties due to branchinstructions or the like cannot be avoided. In some kinds of software,for example an arithmetical calculation program, in which many accessesare preformed to array elements and the element to be accessed can bepredicted in advance, the number of the cache penalties can be reducedusing prefetch instructions, though CPU time is expended by theexecution of such prefetch instructions and this technique can be usedeffectively for limited range of software.

[0008] In this way, the above techniques are each capable of raising thehit rate of a cache memory in cases where conditions, such as thesoftware executed by a CPU and the media on which data is stored, matchwith the selected method using the cache memory. However, since cachememory is hardware that is disposed in an intermediate position betweenthe CPU and the external memory, when there are differences in theprocessing content of the software to be executed or in the hardwareenvironment that stores the data to be processed by this software, thiscan cause problems such as the predicted cache efficiency not beingobtained and conversely in overheads being produced, which increases theexecution time of the processor. For a processor that is dedicated to acertain application, it may be possible to provide an optimal cachememory system. However, for a processor that is designed to have acertain degree of general-purpose applicability, to ensure that thecache memory is worthwhile, it is necessary to provide a cache memorysystem that does not cause many overheads, even if the effectiveness ofthe cache memory system itself is not especially high. Accordingly, evenif a cache memory system is provided, the improvement in performance isnot especially large.

[0009] It is an object of the present invention to provide an integratedcircuit device including a memory that can be used as a cache with thehighest possible efficiency for the processing content of softwareexecuted by a processor and the hardware environment. It is a furtherobject of the invention to provide an integrated circuit deviceincluding a control function that can use a memory as a cache with thehighest possible efficiency. It is yet another object of the inventionto provide an integrated circuit device that can execute a variety ofsoftware more efficiently.

DISCLOSURE OF THE INVENTION

[0010] In recent years, processing units in which the configuration of adata path or a data flow can be at least partially changed have beenintroduced. An FPGA (Field Programmable Gate Array) is an integratedcircuit device in which logic elements or logic blocks of the sameconstruction whose logic can be changed are laid out in an array, withit being possible to change the interconnects between these elements orblocks so as to change the configuration or construction of data paths.Research is also being performed into integrated circuit devices whereit is possible to change the configuration of data paths usingmedium-scale basic functional units of the same construction thatperform a variety of processes according to instruction sets. Theapplicant of the present invention has developed a processing unitincluding (i) a plurality of types of special-purpose processingelements, each type of special-purpose element having internal datapaths suited to respectively different special-purpose processing, and(ii) sets of wires for connecting these special-purpose processingelements. In this invention, a circuit that controls a cache memory isconfigured using a part of these kinds of processing unit where the dataflows can be changed or reconfigured.

[0011] This is to say, an integrated circuit device according to thepresent invention includes a first memory for inputting data into and/oroutputting data from a second memory and a processing unit in which atleast one data flow is formed and at least part of at least one dataflow is changeable, the processing unit including a data processingsection that processes data that is inputted from and/or outputted tothe first memory, a first address outputting section that outputs afirst address of data that is inputted and/or outputted between thefirst memory and the data processing section, and a second addressoutputting section that outputs a second address of data that isinputted and/or outputted between the first memory and the secondmemory. By constructing a first address outputting section and a secondaddress outputting section using part of the processing unit where thedata flows can be changed, using the hardware configuration of the dataprocessing section or the software executed in the data processingsection, it is possible to change the data flow of the first addressoutputting section or the second address outputting section and tocontrol the outputs of these sections. Accordingly, a cache system thatis optimal for the processing executed by an integrated circuit devicecan be configured in the integrated circuit device. Alternatively, it ispossible to configure a control circuit for a cache memory in theintegrated circuit device so that a cache system can be optimallycontrolled for the processing executed by the integrated circuit device.

[0012] With the integrated circuit device of the present invention, thefirst memory that is used as the cache memory can be passivelycontrolled by a second address in a second memory. The second addressincludes not only a physical address of data in the second memory butalso a logical address or virtual address that can be converted into thephysical address. Through such control, it becomes possible to make thefirst memory is transparent to the second memory and/or the dataprocessing section. In addition, according to data or a signal from thedata processing section and/or the first address outputting section, thesecond address outputting section can actively control input and outputsof data independently of both the data processing section and the firstaddress outputting section. It is also possible to control input/outputoperations between the first memory and second memory in parallel withthe operations of the data processing section and the first addressoutputting section. Accordingly, it is possible to configure a cachesystem where the accessed location of data used by the data processingsection and first address outputting section is determined by the secondaddress outputting section, so that it is possible to construct notsimply a conventional cache that is transparent for a CPU but a cachethat controls the processing in the processing unit.

[0013] This is to say, conventional cache architecture is constructed soas to provide a uniform, transparent interface that can improve theaverage execution speed for software that operates on a processingstructure of a standardized hardware construction, such as a CPU core ora DSP core. On the other hand, in the integrated circuit device of thethis invention, a data processing section that acts as a core isprovided by using an architecture such as an FPGA in which theconstruction of a data path itself can be changed, and in accordancewith this, the cache construction can be dynamically changed to anoptimal construction for the configuration in the data processingsection and the software executed by the configuration of the dataprocessing section. Accordingly, there is no need for always uniformityor transparency, and an interface or service that is completelydifferent to a conventional cache can be provided for a data processingsection that is the core or execution unit.

[0014] In this way, with the integrated circuit device of the presentinvention, the first memory can be used with the highest possibleefficiency as a cache in accordance with the hardware environment andthe processing content of the software executed by the processing unit.A cache system that can produce a higher hit rate can be constructedwhen a variety of software is executed, so that it is possible toprovide an integrated circuit device where input/outputs for a cachememory do not cause overheads when a variety of software is executed.

[0015] As one example, when the address in the second memory of data tobe executed by the data processing section is known, it is possible toprefetch data using the remaining amount of space in the first memory bythe second address outputting section independently. Accordingly, datacan be prefetched into the second memory that is used as a cache byhardware or by software that controls the second address outputtingsection without consuming processing time of the data processingsection. In this example, an address in a first memory, that includesnot only a physical address in the first memory but also a virtualaddress or logical address that can be converted into the physicaladdress in the first memory, is outputted from the first addressoutputting section as the first address, and an address in a secondmemory, that includes not only a physical address in the first memorybut also a virtual address or logical address that can be converted intothe physical address is outputted from the second address outputtingsection as the second address. In the data processing section, hardwareor software is configured so that processing advances using addresses inthe first memory that acts as a cache memory.

[0016] In addition, it is preferable for the second address outputtingsection to be capable of operating asynchronously with, which is to sayindependently of, the data processing section and/or the first addressoutputting section. By doing so, data can be prefetched by parallelprocessing independently of the data processing section. To make itpossible to process inputs and outputs for the second memoryindependently and in parallel, it is preferable to provide the firstmemory with a plurality of storing sections, such as a plurality ofmemory banks, for which inputs and outputs can be performedasynchronously or independently.

[0017] It is also possible to configure the second address outputsection so as to output the second address based on data in the firstmemory, by the second address outputting section alone or by acombination of the second address outputting section and the dataprocessing section. By this configuration, data processing can beexecuted by indirect addressing with no limitations whatsoever.

[0018] It is preferable for the first memory that operates as a cache toinclude a first input memory that stores data to be inputted into thedata processing section and a first output memory that stores data thathas been outputted from the data processing section. By doing so, inputsand outputs of data for the data flows formed in the data processingsection can be controlled independently. An address in the first memoryis outputted from the first address outputting section, but when thereis no space for storing the data corresponding to the first address orthere is no data corresponding to the first address in the first memory,a failure may occur in the processing of a data flow formed in the dataprocessing section. For this reason, it is preferable to provide a firstarbitrating unit that manages inputs and/or outputs between the firstmemory and the data processing section.

[0019] The first arbitrating unit can be provided with a function thatoutputs a stop signal to the data processing section when the conditionsfor input into or output from the data processing section are notsatisfied, such as when there is no data corresponding to the firstaddress or when there is no space for storing data corresponding to thefirst address. The data processing section can also be provided with afunction for stopping the processing of at least one data path or dataflow that is configured in the data processing section according to thestop signal, so that the data path or data flow can be turned on and offby the first arbitrating unit. It is possible to easily realize controlthat has a data path or data flow that is formed in the data processingsection operate after first waiting until the data to be processed isprepared.

[0020] If the first memory includes a first input memory and a firstoutput memory, it is preferable to provide a first input arbitratingunit that manages data transfers from the first input memory to the dataprocessing section and a first output arbitrating unit that manages datatransfers from the data processing section to the first output memory asthe first arbitrating unit. It is possible to control data flows formedin the data processing section independently from both the input sideand the output side.

[0021] When the first memory includes a plurality of storage sectionsthat are capable of independent inputs and outputs, the firstarbitrating unit can be provided with a function that manages theplurality of storage sections independently. In this case, each of theplurality of data flows formed in the data processing section can becontrolled independently by the first arbitrating unit according to thestate of the corresponding storing section. On the other hand, the firstarbitrating unit can be provided with a function that manages aplurality of storing sections relationally or with the storing sectionsbeing associated with one another. By doing so, it is easy to realizecontrol that has data flows formed in the data processing section givepriority to processing data that is inputted into a predeterminedstoring section from an external memory and has outputs from data flowsoutputted with priority to the external memory via a predeterminedstoring section.

[0022] In addition, when a plurality of data flows can be configured inthe data processing section, it is preferable to provide a plurality offirst memories and to have a pair of first and second address outputtingsections configured in the processing unit corresponding to each firstmemory. It becomes possible to construct a multilevel or hierarchicalcache by appropriately configure the data processing section and thefirst address outputting section. Also, depending on the programexecuted by the integrated circuit device, a plurality of first memoriescan be divided and used as an instruction cache and a data cache, andwhen a plurality of data processing sections are provided, the pluralityof first memories can be used for caching the data processed by thesedata processing sections and the data cached by the respective firstmemories can be appropriately controlled by the second addressoutputting section.

[0023] When a plurality of second address outputting sections areprovided, a second arbitrating unit that manages inputs and outputsbetween the second memory and the plurality of first memories shouldpreferably be provided and the second address should preferably besupplied to the second arbitrating unit. When the second memory is anexternal memory, the integrated circuit device of the present inventioncan access the external memory in the same way as a conventionalintegrated circuit device. Also, in an integrated circuit device wherethe second memory is formed on the same chip, it is possible toconstruct the cache memory hierarchically by providing a third addressoutputting means that outputs a third address of the data that isinputted and/or outputted between a third memory and the second memoryso as to make it possible to input and/or output data between the secondmemory and the third memory. This is to say, if the third memory is anexternal memory, the cache memory can be composed of the first andsecond memories. This third address outputting means may be aconventional cache control mechanism such as an MMU, though it is alsopossible for the third address outputting means to have a similarconstruction to the second address outputting section. This is also thecase when control is performed for a fourth or higher level of memory(which is not restricted to ROM and RAM and may include various types ofstorage media such as disks).

[0024] A processing unit in which the data flow can be changed orreconfigured may include a type of processing unit that includes aplurality of logic elements of the same type whose functions can bechanged and a set of wires for connecting these logic elements, which isan FPGA above, and another type of processing unit in which the datapath arrangement or data flows can be changed using medium-scale basicfunctional units of the same construction. It is also possible to use afurther different type of processing unit that includes (i) a pluralityof types of special-purpose processing elements, each type of thespecial-purpose processing element including internal data paths suitedto respectively different special-purpose processing and (ii) sets ofwires for connecting these special-purpose processing elements. Withthis type of reconfigurable processing unit, it is possible toincorporate special-purpose processing elements including internal datapaths that are suited to outputting addresses, so that the processingefficiency for generating addresses is increased and the processingspeed can be further improved. Also, since there is a reduction in thenumber of surplus circuit elements, a reduction can be made in thenumber of elements that are selected to change the data flow, the ACcharacteristics can be improved, and an increase is also made in thespace efficiency.

[0025] Accordingly, by having a control unit, which indicates changes toat least part of a data flow in the processing unit, execute a processthat instructs the processing unit to construct the data processingsection, first address outputting section, and second address outputtingsection mentioned above, a data flow can be flexibly and dynamicallychanged in a short time. This makes it possible to provide a compact,economical integrated circuit device that includes a flexible cachesystem.

[0026] To facilitate changes in the data flows in the processing unit,it should preferably be possible to, in addition to change theconnections between the special-purpose processing elements, to include(i) means that select parts of the internal data paths of thespecial-purpose processing elements and (ii) configuration memories thatstore selections of the internal data paths. The control unit canreconfigure data flows by rewriting the content of the configurationmemories or by indicating changes to at least part of a data flow in theprocessing unit. If the processing unit includes special-purposeprocessing elements, the control unit can indicate changes in the dataflow in the data processing section, the first address outputtingsection, or the second address outputting section asynchronously andindependently. While data is being inputted into or outputted from thefirst memory, the special-purpose processing elements that compose thedata processing section and/or first address outputting section can beused to configure a data flow for another purpose. Conversely, whileprocessing is being executed by the data processing section, thespecial-purpose processing elements of the second address outputtingsection can be used to control a different memory or be used for adifferent purpose, so that the resources of the processing unit can beflexibly and efficiently utilized.

[0027] By incorporating a code memory for storing program code that hasthe control unit perform the above processing, it becomes possible toconstruct an integrated circuit device, such as a single-chip systemLSI. Accordingly, it becomes possible to provide integrated circuitdevices with improved execution speed where a cache or cashes are usedefficiently for a variety of types of software without causingoverheads. It is also possible to provide a processing unit whose dataflows can be reconfigured as a separate chip, as a processor core, or asa chip in which the first memory used as the cache memory is alsoincorporated. In this way, the present invention can be embodied in avariety of ways, with processing devices that correspond to suchembodiments also being included within the scope of the presentinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0028]FIG. 1 is a block diagram showing an arrangement of an integratedcircuit device according to an embodiment of the present invention.

[0029]FIG. 2 shows an arrangement of an AAP that is a processing unit.

[0030]FIG. 3 shows an arrangement of the matrix unit.

[0031]FIG. 4 shows an example of a data path potion that is suited toprocessing that outputs addresses.

[0032]FIG. 5 is a diagram showing the construction of the addressgenerator of the data path portion shown in FIG. 4.

[0033]FIG. 6 is a diagram showing the construction of the counter shownin FIG. 5.

[0034]FIG. 7 is a diagram showing an arrangement of a different addressgenerator to FIG. 5.

[0035]FIG. 8 is a diagram showing how a large-capacity RAM is controlledas an external memory.

[0036]FIG. 9 is a diagram showing how a large-capacity RAM and aperipheral device are controlled as an external memory.

[0037]FIG. 10 is a diagram showing how a plurality of large-capacityRAMs and peripheral devices are controlled as an external memory.

[0038]FIG. 11 is a diagram showing how a large-capacity RAM iscontrolled as an external memory by a different integrated circuitdevice according to the present invention.

BEST MODE FOR CARRYING OUT THE PRESENT INVENTION

[0039] The following describes the present invention with reference tothe attached drawings. FIG. 1 shows the outline configuration of asystem LSI 10 according to the present invention. This LSI 10 is a dataprocessing system that includes a processor unit 11, an AAP (AdoptiveApplication Processor) portion or unit (hereinafter AAP) 20, aninterrupt control unit 12, a clock generating unit 13, an FPGA unit 14,and a bus control unit 15. The processor unit 11 (hereinafter “basicprocessor” or “processor”) that has a general-purpose construction andperforms general purpose processing, including error handling, based oninstruction sets that are provided by a program or the like. In the AAPunit 20, data flows or virtual data flows that are suited tospecial-purpose data processing are variably formed by a plurality ofoperation or logical elements that are arranged in a matrix. Theinterrupt control unit 12 controls interrupt handling for interruptsfrom the AAP 20. The clock generating unit 13 supplies an operationclock signal to the AAP 20. The FPGA unit 14 further improves theflexibility of the operation circuits that can be realized by the LSI10. The bus control unit 15 controls inputs and outputs of data to andfrom the periphery. The FPGA unit 14 is an interface for an FPGA chipthat is disposed in the periphery of the LSI 10 and is referred tohereinafter as the “offchip FPGA” and the “FPGA”. In the LSI 10 that isthe integrated circuit device of the present invention, the basicprocessor 11 and the AAP 20 are connected by a data bus 17 on which datacan be exchanged between the basic processor 11 and the AAP 20 and aninstruction bus 18 for enabling the basic processor 11 to control theconfiguration and operation of the AAP 20. Also, interrupt signals aresupplied from the AAP 20 to the interrupt control unit 12 via a signalline 19, and when the processing of the AAP 20 has ended or an error hasoccurred during such processing, the state of the AAP 20 is fed back tothe basic processor 11.

[0040] The AAP 20 and the FPGA 14 are connected by a data bus 21, sothat data is supplied from the AAP 20 to the FPGA 14, where processingis performed, and the result is then returned to the AAP 20. Also, theAAP 20 is connected to the bus control unit 15 by a load bus 22 and astore bus 23, and so can exchange data with a data bus on the outside ofthe LSI 10. Accordingly, the AAP 20 can receive an input of data from anexternal DRAM 2 or another device and output a result produced byprocessing this data in the AAP 20 back to the external device. Thebasic processor 11 can also input and output data to and from anexternal device via a data bus 11 a and the bus control unit 15.

[0041]FIG. 2 shows an outline of the AAP unit 20. The AAP unit 20 of thepresent embodiment comprises a matrix unit or portion 28 in which aplurality of logical blocks, logical units, and/or logical elements(hereinafter “elements”) that perform arithmetical and/or logicaloperations are arranged in a matrix, an input buffer 26 that suppliesdata to the matrix unit 28, and an output buffer 27 that stores datathat has been outputted from the matrix unit 28. The input buffer 26 andoutput buffer 27 respectively comprise four small-capacity inputmemories (RAMs) 26 a to 26 d and four output memories (RAMs) 27 a to 27d. The AAP 20 further comprises an external access arbitrating unit(second arbitrating unit) 25 that controls data input/output operationsbetween (i) the bus control unit 15 and (ii) the input buffer 26 andoutput buffer 27 that comprise a plurality of memories.

[0042] The input RAMs 26 a to 26 d and output RAMs 27 a to 27 d of thepresent embodiment each functions as 1 Kbyte dual-port RAMs, and eachcan be used as dual-bank RAMs 81 and 82 that are 64 k bits wide and 512bytes deep. Accordingly, by using different banks for inputs and outputsfor the memory, it is possible to process input and output operationsindependently. An arbitrating unit 85 (first arbitrating unit) thatmanages inputs into and outputs from the RAMs 81 and 82 is also providedand it is possible to check whether each bank is full or empty bycounting the number of inputs and outputs.

[0043] In order to control the inputting and outputting of data into theinput RAMs 26 a to 26 d and out of the output RAMs 27 a to 27 d, aplurality of types of control signals are exchanged between (i) thematrix unit 28 and (ii) the RAMs and the arbitrating unit 85. First,16-bit input readout address data (“ira” or the “first address”) 61 forcontrolling the data that is read out by the matrix unit 28 from theinput RAMs 26 a to 26 d is outputted to each of the input RAMs 26 a to26 d. The input readout address 61 is a logical or physical address inthe input RAMs 26 a to 26 d. Also, an input readout address stop signal(“ira_stop”) 62 for controlling the supplying of the address data 61depending on the full and/or empty states is outputted from thearbitrating unit 85 of each of the input RAMs 26 a to 26 d to the matrixunit 28. The input readout address stop signal 62 is also outputted fromthe arbitrating unit 85 when the input conditions for the matrix unit 28are not ready, such as when there is no data corresponding to theaddress data 61 supplied from the matrix unit 28.

[0044] In the matrix unit 28, a data flow or data flows that are formedin the matrix unit 28 are turned on and off by the stop signals 62.Accordingly, in the execution process that is performed after the dataflows have been configured in the matrix unit 28, the execution of theprocessing defined by the data flows can be respectively controlled bythe arbitrating units 85 of the input RAMs 26 a to 26 d. If the datathat corresponds to the input readout address data 61 is not present inthe input RAM 26, the processing of the data flow is placed into a waitstate. Conversely, if the data that corresponds to the input readoutaddress data 61 is present in the input RAM 26, 32-bit input readoutdata (“ird”) 63 is supplied to the matrix unit 28, is processed by theconfigured data flow, and is outputted to one of the output RAMs 27.Also, a stop signal (“ird_stop”) 64 that controls the input readout data63 is outputted from the matrix unit 28 to each of the input RAMs 26 ato 26 d so that the reading out of data is stopped when the operation ofthe data flow in the matrix unit 28 has stopped due to a cause on theoutput side, for example.

[0045] The arbitrating unit 85 of each of the input RAMs 26 a to 26 dfundamentally controls each of the input RAMs 26 a to 26 dindependently. Accordingly, the exchanging of data between the matrixunit 28 and the input RAMs 26 a to 26 d is controlled and executedseparately for each of the input RAMs 26 a to 26 d, so that data flowsthat are formed in the matrix unit 28 corresponding to the input RAMs 26a to 26 d are controlled independently. This is also the case for theoutput RAMs 27 a to 27 d that are described below. On the other hand,the arbitrating units 85 of the input RAMs 26 a to 26 d can connect bywiring between the input RAMs 26 a to 26 d or by wiring via the matrixunit 28, so that a plurality of input RAMs 26 a to 26 d can be managedrelationally or associated with one another. By managing input RAMs 26 ato 26 d relationally, it becomes possible to assign a plurality of inputRAMs to a data flow configured in the matrix unit 28. By attaching anorder of priority to the plurality of input RAMs 26 a to 26 d using thearbitrating units 85, it is also possible to perform control thatsupplies data flows with data from RAMs with high priority.

[0046] Also, 32-bit input write address data (“iwa” or the “secondaddress”) 65, which controls the data to be read out from an externalmemory 2 via the bus control unit 15 and written in each of the inputRAMs 26 a to 26 d, and a 4-bit control signal (“iwd_type”) 66, which canindicate the data type, etc., of the input data, are outputted from thematrix unit 28 in the unit of each of the input RAMs 26 a to 26 d. Theinput write address data 65 and the control signals 66 that correspondto the respective input RAMs 26 a to 26 d are all outputted to theexternal access arbitrating unit 25. The input write address data 65 isa physical address in the RAM 2, which is an external memory, or alogical or virtual address that corresponds to the physical address inthe RAM 2. In response to these addresses, stop signals (“iwa_stop”) 67,each of that controls the output of the address data 65, are suppliedfrom the external access arbitrating unit 25 to the matrix unit 28.

[0047] Furthermore, 64-bit input write data (“iwd”) 68 that correspondsto the input write address data 65 supplied to the external accessarbitrating unit 25 is respectively supplied from the arbitrating unit25 to each of the input RAMs 26 a to 26 d, and a stop signal(“iwd_stop”) 69 that controls the input write data 68 is supplied fromeach of the input RAMs 26 a to 26 d to the external access arbitratingunit 25.

[0048] In order to control outputs from the matrix unit 28, 16-bitoutput write address data (“owa” or the “first address”) 71 forcontrolling data that is read out from the matrix unit 28 and written ineach of the output RAMs 27 a to 27 d is outputted to each of the outputRAMs 27 a to 27 d. This output write address data 71 is a logical orphysical address in each of the output RAMs 27 a to 27 d. An outputwrite address stop signal (“owa_stop”) 72, which controls the supplyingof the address data 71 based on full and/or empty states, is outputtedfrom the arbitrating unit 85 of each of the output RAMs 27 a to 27 d tothe matrix unit 28. This is to say, when the conditions for thereception of an output from the matrix unit 28 are not satisfied, theoutput write address stop signal 72 is outputted from the arbitratingunit 85. In the matrix unit 28, the data flows that are configured inthe matrix unit 28 are turned on and off by the stop signals 72, therebycontrolling the execution of the processing defined by the data flows.If there is space in the output RAM 27, 32-bit output write data (“owd”)73 is outputted from the matrix unit 28 together with the output writeaddress data 71. A stop signal (“owd_stop”) 74 that controls the outputwrite data 73 is supplied from the arbitrating unit 85 of each of theoutput RAMs 27 a to 27 d to the matrix unit 28.

[0049] Also, 32-bit output readout address data (“ora” or the “secondaddress”) 75 for controlling data to be read out from each of the inputRAMs 26 a to 26 d via the bus control unit 15 and written into theexternal memory 2 and a 4-bit control signal (“ord_type”) 76 that canindicate the data type, etc., of this data are outputted from the matrixunit 28 in the unit of the output RAMs 27 a to 27 d. The output readaddress data 75 and the control signals 76 are all outputted to theexternal access arbitrating unit 25. The output read address data 75 isa physical address in the DRAM 2, which is an external memory, or alogical or virtual address that corresponds to the physical address inthe DRAM 2. In response to this, a stop signal (“ora_stop”) 77 thatcontrols the outputting of the address data 75 is supplied to the matrixunit 28 from the external access arbitrating unit 25.

[0050] Furthermore, 64-bit output readout data (“ord”) 78 is suppliedtogether with the output readout address data 75 from each of the outputRAMs 27 a to 27 d to the external access arbitrating unit 25, and a stopsignal (“ord_stop”) 79, which controls the output readout data 78, issupplied from the external access arbitrating unit 25 to each of theoutput RAMs 27 a to 27 d.

[0051] With the AAP unit 20 of the present embodiment, the input data 63of the matrix unit 28 is supplied from the bus control unit 15, which isthe interface for the external memory 2, via the plurality of input RAMs26 a to 26 d and the external access arbitrating unit 25. Also, theoutput data 73 from the matrix unit 28 is supplied to the bus controlunit 15, which is the interface for the external memory 2, via theplurality of output RAMs 27 a to 27 d and the external accessarbitrating unit 25. The input RAMs 26 a to 26 d and the output RAMs 27a to 27 d each have a dual-bank construction, so that (a) the processingbetween the input RAMs 26 a to 26 d, the output RAMs 27 a to 27 d, andthe matrix unit 28, and (b) the processing between the input RAMs 26 ato 26 d, the output RAMs 27 a to 27 d, and the external accessarbitrating unit 25, which is to say, the processing that involves theexternal RAM 2, can be executed independently and asynchronously inparallel.

[0052] Between the external access arbitrating unit 25 and the buscontrol unit 15, the load bus 22 and the store bus 23 each comprising a32-bit address bus and a 256-bit data bus are arranged so that data canbe inputted and outputted at high speed in block units. The inputaddress signal 22 a and the output address signal 23 a are transmittedvia the address bus, and the input data 22 b and the output data 23 bare outputted via the data bus. Signal lines that transmit the 5-bitcommands 22 c and 23 c, signal lines that transmit busy signals 22 d and23 d of the bus control unit 15, and a signal line that transmits aready signal 22 e of the bus control unit 15 are also provided.

[0053]FIG. 3 shows an arrangement of a partial configuration 29 of AAP20 comprising the matrix unit 28 and the small-capacity RAMs 26 a to 26d and 27 a to 27 d of the present embodiment. In the present invention,the matrix unit 28 is a system corresponding to the processing unit inwhich data paths or data flows are reconfigurable or changeable. Thematrix unit 28 comprises a plurality of elements 30 that are operationunits, with these elements 30 being arranged in an array or matrix so asto form four lines in the vertical direction. Between these elements 30,the matrix unit 28 also comprises row wiring sets 51 that extend in thehorizontal direction and column wiring sets 52 that extend in thevertical direction. The column wire sets 52 include a pair of wire sets52 x and 52 y that are composed of the wires in the column direction onthe left and right sides, respectively, of the operation units 30, withdata being supplied to the individual elements 30 by these wire sets 52x and 52 y.

[0054] Switching units 55 are disposed at intersections between the rowwire sets 51 and the column wire sets 52, with each switching unit 55being able to switch and connect any of the channels of the row wire set51 to any of the channels of a column wire set 52. Each switching unit55 comprises a configuration RAM that stores settings, and by having thecontent of the configuration RAM rewritten according to data suppliedfrom the processor unit 11, the connections between the row wire set 51and the column wire set 52 can be dynamically controlled as desired.Accordingly, in the matrix unit 28 of the present embodiment, aconfiguration of at least one data flow that is formed of all or partsof the plurality of elements 30 by connecting the wire sets 51 and 52can be dynamically changed as desired.

[0055] Each element 30 comprises a pair of selectors 31 thatrespectively select input data from the pair of column wire sets 52 xand 52 y and an internal data path 32 that performs a specifiedarithmetic and/or logical operation process on the selected input data“dix” and “diy” and outputs output data “do” to the row wire set 51.Elements 30 with internal data paths that execute different processesare arranged on different rows in the matrix unit 28 of the presentembodiment. The row wire sets 51 and column wire sets 52 also comprisewires for transferring carry signals. The carry signals can be used assignals that show a carry or as signals that show true or false, and inthe matrix unit 28, these carry signals are used for controlling thearithmetic operations and logic operations of each element 30 and fortransferring results to other elements 30.

[0056] First, the elements 30 that are arranged on the first rowcomprise data path units 32 i that are suited to processing thatreceives data from the input buffer 26. If these data bus units (“LD”)32 i for load operations simply receive an input of data, logic gatesare not required, and data is simply received via the load bus 22 and isoutputted to the row wire set 51. In the matrix unit 28, the data pathunits 32 i for load operations each have a function for stopping theprocessing of the data flow to which the element 30 including this datapath unit 32 i is connected when the stop signal 62 is received from theRAM arbitrating unit 85 of the input RAM 26. Also, the data path units32 i for load operations also each have a function for outputting thestop signal 64 to the arbitrating unit 85 of the corresponding input RAM26 when the data flow to which the element 30 including the data pathunit 32 i is connected stops due to an internal factor in the matrixunit 28 or an output-side factor.

[0057] The elements 30 a that are arranged on the second row areelements for writing data from the external RAM 2 into the input RAMs 26a to 26 d of the input buffer 26, and correspond to the second addressoutputting sections. Accordingly, these elements 30 each comprise a datapath portion or unit 32 a with an internal data path that is suited togenerating an address (second address) for block loading. Such data pathunits 32 a are called BLAs (Background Load Address Generators). FIG. 4shows an example of the data path unit 32 a that comprises an addressgenerator 38 composed of a counter, etc., with an address beingoutputted from this address generator 38 as the output signal “do”. Theoutput signal “do” is supplied via the row wire set 51 and the columnwire set 52 as it is or after processing by other elements 30 to a datapath unit 32 as the input signal “dix” or “diy”, one of the suppliedaddresses is selected by a selector “SEL”, and outputted via a flip-flop“FF” from the matrix unit 28 to the external access arbitrating unit 25as the input write address data 65.

[0058] Like all of the elements 30 that compose the matrix unit 28, theelements 30 that generate these addresses comprise a configuration RAM39 for setting conditions of an address generator 38 and selector SEL.The data in the configuration RAM 39 is set by a control signal 18 fromthe basic processor 11.

[0059]FIG. 5 shows one example of the address generating circuit 38.This address generator 38 comprises a plurality of counters 38 a and anadder 38 b that performs some operations on the outputs of thesecounters 38 a and outputs the result as an address. As shown in FIG. 6,each of the counters 38 a comprises a combination of an arithmetic logicunit ALU 38 c and a comparator 38 d, with it being possible to set anADD, SUB, bit shift, OR, XOR, or a combination of these operations inthe ALU 38 c. The counters 38 a each have a function as a functiongenerating circuit that generates a value every time the clock signalrises. The functions of the counters 38 a can be set by the processorunit 11 via the configuration RAM 39.

[0060] The control signal “en” of the ALU 38 c can be set by a carrysignal “cy” supplied from another counter 38 a and the output of thecomparator 38 d can be transmitted to another counter 38 a as the carrysignal “cy”. By using the carry signal in this way, the state of anothercounter 38 a can be set according to the state of a counter 38 a and adesired address can be generated. Also, though not shown in the drawing,the control signal “en” of the counter 38 a can be set according to thecarry signal “cy” supplied from another element 30 and can betransmitted to another element 30.

[0061] The element (BLA) 30 a that outputs the input write address data65 has a construction of the data path unit 32 a including an addressgenerating circuit 38 that is suited to the generation of addresses,with it being possible to control the processing content of the addressgeneration from the processor 11 via the configuration RAM 39. It isalso possible to freely set how the element (BLA) 30 a is related to theother elements 30. The plurality of counters 38 a that are included inthe BLA 32 a are 32-bit counters, for example, and can generate anaddress for DMA transfer from the external memory 2 to the input RAMs 26a to 26 b that are local store buffers.

[0062] The elements 30 b arranged on the third row in FIG. 3 comprisedata path units 32 b that generate input readout addresses 61 forloading desired data from each of the input RAMs 26 a to 26 d into thematrix unit 28, and correspond to the first address outputting sections.The data path unit 32 b is called an LDA (Load Address Generator). Theconstruction of these data path units 32 b is fundamentally the same asthe construction of the data path units 32 a described above thatgenerate addresses, except that the data path units 32 b output 16-bitaddresses, not 32-bit addresses like the data path units 32 a.Accordingly, the fundamental configuration of the data path units 32 bis as shown in FIG. 4.

[0063] One example of the address generating circuit 38 included in eachLDA 32 b is shown in FIG. 7. This address generator 38 comprises four16-bit counters 38 a and generates an address for transferring data fromthe input RAMs 26 a to 26 b, which are the local store buffers, to thematrix unit 28. The control signal “en” of the counter 38 a can be setby the carry signal “cy” supplied from another element 30 and isconstructed so that the control signal “en” can be transmitted toanother element 30. Data is supplied from the input RAMs 26 a to 26 d tothe matrix unit 28 according to the input readout address data 61outputted from this element 30, with this data being processed in someoperations performed by another logic or operation element that composesthe matrix unit 28.

[0064] The elements 30 c that are arranged on the fourth and fifth rowscomprise data path units (“SMA”) 32 c that are suited to arithmeticoperations and logic operations. As one example, these data path units32 c comprise a shift circuit, a mask circuit, an ALU and aconfiguration RAM 39 for setting the operation to be executed by theALU. Accordingly, the input data “dix” and “diy” can be subjected tooperations such as addition, subtraction, a comparison, a logical AND ora logical OR according to an instruction written by the processor 11,with the result being outputted as the output data “do”.

[0065] The elements 30 d that are arranged on the next row down comprisedata path units (“DEL”) 32 d that are suited to processing that delaysthe timing at which data is transferred. As one example, a data pathcomposed of a combination of a plurality of selectors and flip-flop FFsis provided in these data path units 32 d, and by having the input data“dix” and “diy” take a path that is selected by the selectors accordingto the data in the configuration RAM 39, the input data “dix” and “diy”are delayed by a desired number of clocks and then outputted as outputsignals “dox” and “doy”.

[0066] The elements 30 e that are arranged on the next row down comprisedata path units (“MUL”) 32 e that comprise multipliers or the like andare suited to multiplication. Elements that comprise data path units 32f for an interface with the FPGA unit 14 that is provided on the outsideof the matrix unit 28 are also provided as another kind or type ofelements 30 f, with these elements 30 f being able to continuouslyperform processing that supplies data to the FPGA unit 14 and returnsthe data to the matrix unit 28 after processing.

[0067] Elements 30 g and 30 h that respectively comprise data path units32 g and 32 h that are suited to generating store addresses are arrangedfurther below the region correspond to the data processing section inwhich the above types of elements are arranged. These data path units 32g and 32 h have fundamentally the same construction as the data pathunits 32 b and 32 a respectively that generate addresses and weredescribed above with reference to FIGS. 4 to 7. The elements 30 g thatcomprise the data path units 32 g are the first address outputtingsections and output the output write addresses 71 for writing dataoutputted from the matrix unit 28 into the output RAMs 27 a to 27 d.Therefore, the data outputted from the data processing systems using thevarious types of elements 30 c to 30 f that are described above iswritten into the output RAMs 27 a to 27 d. Each data path unit 32 g iscalled an STA (Store Address Generator) and has the same configurationas the LDA 32 b.

[0068] The elements 30 h that comprise the data path units 32 h and arearranged below these elements (STA) 30 g are the second addressoutputting sections and output the output readout addresses 75 forreading out data from the output RAMs 27 a to 27 d and writing data intothe external RAM 2 so that data processed by the matrix unit 28 iswritten into the external RAM 2. Each data path unit 32 h is called aBSA (Background Store Address Generator) and has the same constructionas the BLA 32 a.

[0069] Elements 30 comprising data path units 32 s that are suited tothe outputting of data for storing are arranged on the final row. Thesedata path units 32 s are called “ST”, with it being possible to use datapath units with almost the same construction as the data path units 32 cfor arithmetic operations. Also, in the present embodiment, each datapath unit 32 s for outputting is provided with a function for stoppingthe processing of the data flow that is connected to the element 30including the data path unit 32 s when a stop signal 74 is received fromthe arbitrating unit 85 of the output RAM 27.

[0070] In this way, the matrix unit 28 of the present embodimentcomprises elements 30 a with internal data paths (BLA) 32 a thatgenerate addresses for inputs (block loads) of data from the externalRAM 2 into the input RAMs 26 a to 26 d and elements 30 b with internaldata paths (LDA) 32 b that generate addresses for inputs of data intothe matrix unit 28 from these input RAMs 26 a to 26 d. The matrix unit28 also comprises elements 30 g with internal data paths (STA) 32 g thatgenerate addresses for outputs of data from the matrix unit 28 to theoutput RAMs 27 a to 27 d and elements 30 h with internal data paths(BSA) 32 h that generate addresses for outputs (block loads) of data inthe output RAMs 27 a to 27 d to the external RAM 2. These elements 30 a,30 b, 30 g, and 30 h each have a data path that is suited to thegeneration of the addresses mentioned above, with it being possible tochange the configurations and functions of the data path by rewritingthe data in the configuration RAM 39. The connections with the otherelements 30 in the matrix unit 28 can also be changeable by changing theconnections of the row wire sets 51 and the column wire sets 52.Accordingly, data for address generation can be provided from theprocessor 11 and/or from other elements 30 in the matrix unit 28 and thetiming at which addresses are generated can be flexibly controlled.

[0071] In this way, according to a variety of conditions and/orconstructions, data can be loaded from the external RAM 2 and input RAMs26 a to 26 d that are used as caches. Separate to this processing, datacan also be loaded into the matrix unit 28 asynchronously and/orindependently from the input RAMs 26 a to 26 d according to differentconditions. In addition, the elements 30 a and 30 b are independent, sothat such processing can be executed in parallel. Accordingly, theplurality of input RAMs 26 a to 26 d are storage sections whereinputting and outputting can be performed independently.

[0072] Since each of the input RAMs 26 a to 26 d has a dual-bankconfiguration, inputting and outputting can be performed in parallel foreach of the input RAMs 26 a to 26 d, so that with this configuration,the inputting and outputting of data into and out of each of the inputRAMs 26 a to 26 d can be performed extremely efficiently. This is alsothe case for each of the output RAMs 27 a to 27 d, those are alsostorage sections where inputting and outputting can be performedindependently, and inputting and outputting into and from each of theoutput RAMs 27 a to 27 d can be performed independently and in parallel.Accordingly, in this system, inputs and outputs of data can be performedextremely efficiently for the RAMs 26 a to 26 d and 27 a to 27 d thatoperate as caches.

[0073] The matrix unit 28 of the present embodiment comprises theelements 30 a, 30 b, 30 g, and 30 h with the data path units 32 a, 32 b,32 g, and 32 h that are fundamentally suited to the generation ofaddresses, with the operations of these elements being determinedaccording to instructions from the basic processor 11. This is to say,according to instructions that are supplied via the control bus 18 fromthe basic processor 11, which is the control unit, the circuit foraccessing the RAMs 26 a to 26 d and 27 a to 27 d, which are the firstmemory, is determined and the circuit for accessing the DRAM that is themain memory (the second memory) is also determined.

[0074] In addition, a circuit for controlling the accesses to thesememories is configured in the matrix, so that it is extremely easy todirectly or indirectly reflect the conditions on the inside of thematrix unit 28, for example, the configuration of the data flows, theprocessing results of the data flows, and also the results of processingthat uses other elements of the matrix unit 28, in the operation ofthese circuits. The elements 30 a, 30 b, 30 g, and 30 h are not onlysuited to the generation of addresses but also be freely wired to otherelements in the matrix unit 28 by the wires 51 and 52 in the same way asthe other elements. For this reason, the outputs from the elements 30 a,30 b, 30 g, and 30 h can be controlled by changing the parameters and/orthe processing content of the elements 30 a, 30 b, 30 g, and the 30 haccording to a data flow or data flows that are configured by the otherelements that form the data processing section in the matrix unit 28and/or the software that is executed by the data processing section. Byconstructing a data flow using the other element in addition to theelements 30 a, 30 b, 30 g, and 30 h, the functions of the other elementscan also be used for generating addresses. Therefore, the access methodfor accessing the RAMs 26 a to 26 d and 27 a to 27 d that are the firstmemory that composes the cache system and the access method foraccessing the DRAM 2 that is the main memory (second memory) can beflexibly determined according to conditions on the inside of the matrixunit 28, for example, the construction of the data flows and theprocessing results.

[0075] The matrix unit 28 is reconfigurable according to control fromthe basic processor 11, so that the internal data paths and functions ofthe elements 30 a, 30 b, 30 g, and 30 h that generate addresses can alsobe dynamically reconfigurable and the connections with other elementscan also be dynamically reconstructed. It is also possible to providethe function for instructing reconfiguration of the connections withinelements or between elements on the inside of the matrix unit 28. Whenthe configurations of data flows or data paths are rearranged bychanging the connections with the other elements 30 in the matrix unit28 according to the processing content executed by the matrix unit 28,it is also possible to change the configurations that input and outputdata into and out of the buffer 26 composed of the input RAM and thebuffer 27 composed of the output RAM.

[0076] For this reason, it is possible to use a configurtion that isoptimally suited to the processing executed by the matrix unit 28 forthe cache system that inputs and outputs data to and from the inputbuffer 26 and the output buffer 27, so that the hit rate of the cachecan be raised, and the frequency of rewrites of data in the cache can bereduced. It is also possible to reconfigure the insides of the elements30 a, 30 b, 30 g, and 30 h that generate addresses and the data pathsrelated to these elements on an element-by-element basis and torearrange the cache system separately for each of the RAMs 26 a to 26 dand 27 a to 27 d. This makes the present invention extremely flexible.Accordingly, before a data processing system or systems are configuredin the matrix unit 28 from the other elements 30, it is possible torealize a data input configuration that is suited to the data processingsystem to be configured and commence data loads. On the other hand,after the data processing system has been reconfigured for otherprocessing, the data outputting configuration can be maintained foroutputting continuously the data processed by the data processing systemthat has been already reconfigured. In this way, processing that wasinconceivable with conventional techniques can be executed with greatflexibility. This is to say, the processing performed for the RAMs 26and 27 that are the first memory and the DRAM 2 that is the secondmemory can be executed as desired independently of other elements anddata flows or alternatively as part of the processing of other elementsor data flows. It is also possible to make the elements 30 a, 30 b, 30g, and 30 h that generate addresses to operate relationally orcooperatively, to make a plurality of elements 30 a and/or 30 b operaterelationally or cooperatively, and to have the matrix unit 28 use theplurality of RAMs 26 as a single high-capacity cache.

[0077] Also, it is possible for the element 30 a to perform a processthat outputs the input write address 65 and writes data from the RAM 2when the input RAM 26 a becomes empty, while the element 30 b performs aprocess that loads data into the matrix unit 28 when there is data inthe RAM 26 a. The elements 30 a and 30 b can be made to operateindependently and in parallel, so that data in the external RAM 2 can beprefetched into the input RAM 26 a without wasting the processing timeof the data processing system. If the element 30 a controls the addressat which data is inputted from the external RAM 2, the processing in adata processing system composed of the element 30 b and the matrix unit28 can proceed with only an address in the internal RAM 26 a. If a dataflow-type processing system is defined using a plurality of otherelements 30 in the matrix unit 28, data processing can proceeded in thematrix unit 28 with only the data and without using an address.

[0078] It is also possible to configure a system in which a virtualaddress is outputted from a data processing system in the matrix unit 28and the element 30 b converts this virtual address into a physicaladdress in the input RAM 26 a and supplies data, with the element 30 aconverting the virtual or physical address into a physical address inthe external RAM 2 and loading the data from the external RAM 2 when thedata is not in the input RAM 26 a.

[0079] It is also possible to configure a system where the element (BLA)30 a generates an address from data inputted from the input RAM 26 b,with this address being used to load data from the external RAM 2 intothe input RAM 26 a. Accordingly, completely indirect addressing controlcan be performed by merely the mechanism that performs inputs andoutputs for the input RAM 26 and the output RAM 27 independently of thedata processing system constructed in the matrix unit 28. It is alsopossible to realize a multilevel cache system by linking the operationsof the plurality of input RAMs 26 a to 26 d, the output RAMs 27 a to 27d, and also the access arbitrating unit 25.

[0080] The AAP 20 of the present embodiment is provided with four inputRAMs 26 a to 26 d and four output RAMs 27 a to 27 d that correspond tothe elements 30 that are arranged in four columns. Accordingly, theinput RAMs 26 a to 26 d and the output RAMs 27 a to 27 d can be used asindividual cache memories that respectively correspond to the pluralityof data processing systems configured with the other kinds elements 30in the matrix unit 28. When a plurality of jobs and/or applications areexecuted by the matrix unit 28, the input RAMs 26 a to 26 d and theoutput RAMs 27 a to 27 d can be used separately as optimal caches forthese jobs and/or applications. The elements 30 are arranged in fourcolumns, though the data processing systems configured with these typesof elements 30 are not limited to four. If three or fewer dataprocessing systems are configured in the matrix unit 28, the capacity ofthe cache memory used by one data processing system can be increased byassigning a plurality of RAMs out of the input RAMs 26 a to 26 d and theoutput RAMs 27 a to 27 d to one data processing system. When there arefive or more data processing systems are configured, one RAM is assignedto a plurality of data processing systems as a cache memory. In thiscase, at the worst, the same condition may be occurred as cacheprocessing for multitasking that is performed in a modern CPU of dataprocessing system that shares a RAM.

[0081] As shown in outline in FIG. 8, a system LSI 10 that is theintegrated circuit device or processing device of the present inventioncomprises a configuration or assembly 29 including a matrix portion orpart that is the processing unit and a small-capacity RAM, withaddresses that are outputted to the external RAM 2 from the matrix partbeing supplied to the external RAM 2 via the arbitrating unit 25. Anaddress generating mechanism that controls the inputting and outputtingof data into and out of the small-capacity RAM is realized by the matrixport where data flows can be reconfigured, so that the architecture thatcontrols the small-capacity RAM that functions as a cache memory canalso be reconfigured and so can be changed to an optimal constructionfor the software executed by the matrix unit. Accordingly, with thesystem LSI 10 that is the integrated circuit device or processing deviceof the present invention, the small-capacity RAM can be used as a cachememory in the most efficient manner for the hardware environment and theprocessing content of the software that is to be executed. For a varietyof software programs are executed, a cache memory and a circuit forcontrolling this cache memory can be configured so that a higher hitrate is obtained. Accordingly, it is possible to provide an integratedcircuit device or processing device (system LSI or ASIC) in which nooverloads are caused by inputs into and outputs from the cache memoryfor a variety of software is executed.

[0082] The external memory that can be controlled by the system LSI 10,that is the second memory, is not limited to RAM. The device used as theexternal memory for the input RAM and/or the output RAM is not limitedto a storage device such as a RAM, ROM, or even a hard disk drive, andincludes any device that can input or output data when an address isindicated. As one example, as shown in FIG. 9, when the LSI 10 controlsa large-capacity RAM 2 and a peripheral device 3, such as a printer or adisplay, as an external memory, the elements BLA 30 a and BSA 30 h thatperform block loads for the matrix unit 28 may generate physicaladdresses that are assigned to the peripheral device 3.

[0083] Also, as shown in FIG. 10, it is possible to provide the LSI 10that controls a plurality of large-capacity RAMs 2 and peripheraldevices 3 via a plurality of bus controllers. In this case,modifications, such as the provision of a plurality of arbitrating units25, may be applied. Also, a large-capacity RAM 2 may be implementedinside the LSI 10, and it is also possible to use a construction wherethe large-capacity RAM 2 is used as a cache memory for the peripheraldevices 3. The large-capacity RAM 2 may also be used as a code RAM ofthe processor 11.

[0084] The above explanation describes one example of the constructionof the matrix unit or part 28, though the present invention is notlimited to this construction. In the above description, operationelements that include the special-purpose data paths 32 suited tospecial-purpose processing such as address generation, arithmeticoperations, logic operations, multiplications, and delays are describedas the elements with, though the functions of the data paths and theirconfigurations are not limited to the examples given above. By arrangingelements including data paths with some functions that are suited to theapplications executed by the LSI 10, which is the integrated circuitdevice or data processing device of the present invention, in a matrixor in an array, it is possible to provide a processing unit in whichdata flows can be changed or reconfigured. A plurality of matrix units28 may be implemented or arranged, with the plurality of matrix unitsbeing arranged on the same plane or in three dimensions, so that anintegrated circuit device comprising an even larger number of elementscan be constructed. Also, the integrated circuit device of the presentinvention is not limited to an electronic circuit and can be adapted toan optical circuit or an optoelectronic circuit.

[0085] While the present invention is described above by means of anexample in which an AAP 20, a basic processor 11, and a bus control unit15 are incorporated in a system LSI 10, the range of the components tobe provided as a single chip depends on conditions such as theapplications to be implemented. The AAP 20 may also be provided as asingle chip, or alternatively the part 29 that includes the RAMs 26 and27, which form the cache, and the matrix unit 28 may be packaged into asingle chip. It is also possible to provide a larger system LSI or ASICcomprising a plurality of AAP units or other special purpose circuits inaddition to the basic processor 11.

[0086] As shown in FIG. 11, the integrated circuit device or processingdevice of the present invention can also be realized by using an FPGA asa processing unit in place of the matrix unit 28 and, in the FPGA, inaddition to the data processing section, the first and second addressoutputting sections of the present invention can be programmed or mappedfor using the input RAMs 26 and the output RAMs 27 as caches. An FPGA isan architecture where the configuration of data paths that have wideapplicability can be changed at the transistor level. Research is alsobeing performed into integrated circuit devices where the data paths ordata flows can be reconfigured using medium-scale basic functional unitsthat are of the same construction, the basic functional units consistingthe same kinds of elements (though not at the transistor level) butexecuting various processes according to an instruction set. In aprocessing unit having this kind of architecture, the integrated circuitdevice and processing device of the present invention can also berealized by configuring (or indicating the configuration of), inaddition to a data processing section, a first and second addressoutputting section that have the input RAM 26 and the output RAM 27function as caches.

[0087] Unlike the architecture where basic units of the sameconstruction are aligned, an architecture based on the matrix unitdescribed above comprises a plurality of types of elements, each type ofelement including different internal data paths. Since this is not anarchitecture that needs to have wide applicability on a transistorlevel, the packing density can be raised and a compact, economicalsystem can be provided. Also, since each of the elements 30 comprises adata path unit 32 that is dedicated to special-purpose processing, alarge reduction can be made in the redundancy in the construction.Compared to an FPGA or another processing unit in which basic processingunits of the same construction are arranged, a large increase can bemade in processing speed and the AC characteristics can also beimproved. Also, since space is used more efficiently, a compact layoutcan be used, and the lengths of the wires can also be reduced.Accordingly, the architecture including matrix is suited to anintegrated circuit device or processing device that makes full use ofthe efficient cache construction disclosed by the present invention thatmakes possible to provide a low-cost processing device with higher-speedprocessing.

[0088] Furthermore, unlike an FPGA where circuits are mapped at thetransistor level, i changing the combination of elements 30 that includethe data path units 32 which are suited in advance to special-purposeprocessing has the merit that configurations and functions of the dataprocessing units, that is the data processing systems configured in thematrix unit 28, can be changed in a short time that in most cases is oneclock. Also, in each element 30, the functions of the selectors andlogic gates, such as the ALU, that compose the data path unit 32 can beset independently by the processor 11 via the configuration memory 39,so that the data path unit 32 of each element 30 can be flexibly changedwithin the range of functions that the data path unit is serviced.Accordingly, in the matrix unit 28 of the present embodiment, the rangeof functioning that can be executed by data flow-type data processing isextremely wide. It is also possible to select and arrange suitable typesof operation units 30 for the application, such as network processing orimage processing, for which the LSI 10 is to be used, which makes itpossible to provide an integrated circuit device with even highermounting efficiency and processing speed.

[0089] As described above, the present invention forms the first addressoutputting section and second address outputting section that control afirst memory which can be used as a cache memory in a processing unit inwhich the data flows are changeable. This means that the configurationof the cache system can be dynamically reconfigured to an optimalconfiguration for the configuration of the data processing section andthe software that is executed by the data processing section. When avariety of software is executed, a cache system with a higher hit ratiocan be constructed. Accordingly, it is possible to provide an integratedcircuit device that executes a variety of software or applications in ashorter processing time.

INDUSTRIAL APPLICABILITY

[0090] The processing unit and integrated circuit device of the presentinvention can be provided as a system LSI, an ASIC, or the like that canexecute a variety of data processing. The processing unit and integratedcircuit device of the present invention are not limited to electroniccircuits, and may be adapted to optical circuits or optoelectroniccircuits. The integrated circuit device of the present invention canexecute data processing at high speed using hardware that can bereconfigured, and so is suitable for a data processing device thatperforms processing, such as network processing and image processing,where high-speed and real-time processing is required.

1. An integrated circuit device, comprising: a first memory that forinputting data into and/or outputting data from a second memory; and aprocessing unit in which at least one data flow is formed and in whichat least part of the at least one data flow is changeable, wherein theprocessing unit includes: a data processing section that processes datathat is inputted from and/or outputted to the first memory; a firstaddress outputting section that outputs a first address of data that isinputted and/or outputted between the first memory and the dataprocessing section; and a second address outputting section that outputsa second address of data that is inputted and/or outputted between thefirst memory and the second memory.
 2. An integrated circuit deviceaccording to claim 1, wherein the first address is an address in thefirst memory and the second address is an address in the second memory.3. An integrated circuit device according to claim 1, wherein the secondaddress outputting section is capable of operating independently of thedata processing section and/or the first address outputting section. 4.An integrated circuit device according to claim 1, wherein the firstmemory includes a plurality of storing sections that are capable ofindependent inputs and outputs.
 5. An integrated circuit deviceaccording to claim 1, wherein the first memory includes a first inputmemory that stores data that is to be inputted into the data processingsection and a first output memory that stores data that has beenoutputted from the data processing section.
 6. An integrated circuitdevice according to claim 1, further comprising a first arbitrating unitthat manages inputs and/or outputs between the first memory and the dataprocessing section.
 7. An integrated circuit device according to claim6, wherein the first arbitrating unit has a function that outputs a stopsignal to the data processing section when conditions for an input to oran output from the data processing unit are not satisfied.
 8. Anintegrated circuit device according to claim 7, wherein the dataprocessing section has a function that stops, according to the stopsignal, processing of the at least one data flow that is formed in thedata processing section.
 9. An integrated circuit device according toclaim 6, wherein the first memory includes a first input memory thatstores data that is to be inputted into the data processing section anda first output memory that stores data that has been outputted from thedata processing section, and the first arbitrating unit includes a firstinput arbitrating unit that manages data transfers from the first inputmemory to the data processing section and a first output arbitratingunit that manages data transfers from the data processing section to thefirst output memory.
 10. An integrated circuit device according to claim6, wherein the first memory includes a plurality of storing sectionsthat are capable of independent inputs and outputs, and the firstarbitrating unit has a function that manages the plurality of storingsections independently.
 11. An integrated circuit device according toclaim 6, wherein the first memory includes a plurality of storingsections that are capable of independent inputs and outputs, and thefirst arbitrating unit has a function that manages the plurality ofstoring sections relationally.
 12. An integrated circuit deviceaccording to claim 1, wherein in the data processing section, aplurality of data flows are able to be configured, the integratedcircuit device comprises a plurality of first memories, and the firstaddress outputting section and the second address outputting section areconfigured in the processing unit respectively corresponding to each ofthe plurality of first memories.
 13. An integrated circuit deviceaccording to claim 12, further comprising a second arbitrating unit thatmanages inputs and outputs between the second memory and the pluralityof first memories, wherein the second address is supplied to the secondarbitrating unit.
 14. An integrated circuit device according to claim 1,wherein the processing unit includes a plurality of logic elements of asame type whose functions are changeable and a set of wires that connectthe logic elements.
 15. An integrated circuit device according to claim1, wherein the processing unit includes a plurality of types ofspecial-purpose processing elements, each type of the plurality of typesof special-purpose processing element including internal data path suiteto different special-purpose processing, and a set of wires that connectthe special-purpose processing elements.
 16. An integrated circuitdevice according to claim 15, wherein the processing unit includes atype of special-purpose processing element with internal data path suiteto outputting addresses.
 17. An integrated circuit device according toclaim 15, wherein the special-purpose processing elements include meansfor selecting part of the internal data path and a configuration memorythat stores a selection in the internal data path.
 18. An integratedcircuit device according to claim 17, further comprising a control unitthat rewrites a content of the configuration memory.
 19. An integratedcircuit device according to claim 1, further comprising a control unitthat indicates a change to at least part of the at least one data flowsof the processing unit.
 20. An integrated circuit device according toclaim 19, wherein the control unit is capable of indicating changes tothe at least one data flow of the data processing section, the firstaddress outputting section, or the second address outputting sectionindependently.
 21. An integrated circuit device according to claim 19,further comprising a code memory that stores program code that controlsthe control unit.
 22. An integrated circuit device according to claim 1,further comprising: the second memory that is capable of inputting datainto and/or outputting data out of a third memory; and a third addressoutputting means for outputting a third address of data that is inputtedand/or outputted between the third memory and the second memory.
 23. Anintegrated circuit device, comprising: a first memory that for inputtingdata into and/or outputting data from a second memory; a processing unitin which at least one data flow, which processes data that is inputtedinto or outputted from the first memory, is configured; and a firstarbitrating unit that manages inputs and/or outputs between the firstmemory and the processing unit, wherein the first arbitrating unit has afunction for outputting a stop signal to the data processing sectionwhen conditions for an input to or an output from the data processingsection are not satisfied, and the processing unit has a function thatstops processing of the at least one data flow according to the stopsignal.
 24. An integrated circuit device according to claim 23, whereinat least part of the at least one data flow can be changed in theprocessing unit.
 25. An integrated circuit device according to claim 23,wherein the first memory includes a first input memory that stores datathat is to be inputted into the processing unit and a first outputmemory that stores data that has been outputted from the processingunit, and the first arbitrating unit includes a first input arbitratingunit that manages data transfers from the first input memory to theprocessing unit and a first output arbitrating unit that manages datatransfers from the processing unit to the first output memory.
 26. Anintegrated circuit device according to claim 23, wherein the firstmemory includes a plurality of storing sections that are capable ofindependent inputs and outputs, and the first arbitrating unit has afunction that manages the plurality of storing sections independently.27. An integrated circuit device according to claim 23, wherein thefirst memory includes a plurality of storing sections that are capableof independent inputs and outputs, and the first arbitrating unit has afunction that manages the plurality of storing sections relationally.28. A processing unit in which at least one data flow is formed and inwhich at least part of the at least one data flow is changeable, theprocessing unit comprising: a data processing section that processesdata that is inputted from and/or outputted to a first memory that iscapable of inputting data into and/or outputting data from a secondmemory; a first address outputting section that outputs a first addressof data that is inputted and/or outputted between the first memory andthe data processing section; and a second address outputting sectionthat outputs a second address of data that is inputted and/or outputtedbetween the first memory and the second memory.
 29. A processing unitaccording to claim 28, wherein the second address outputting section iscapable of operating independently of the data processing section and/orthe first address outputting section.
 30. A processing unit according toclaim 28, wherein in the data processing section, a plurality of dataflows are able to be configure, and the processing unit comprises pairsof first and second address outputting sections that respectivelycorrespond to each of a plurality of first memories.
 31. A processingunit according to claim 28, further comprising a plurality of types ofspecial-purpose processing elements, each of the plurality of types ofspecial-purpose processing elements include internal data path suite todifferent special-purpose processing, and a set of wires that connectthe special-purpose processing elements.
 32. A processing unit accordingto claim 31, further comprising a type of special-purpose processingelements that include an internal data path that is suited to outputtingaddresses.
 33. A processing device, comprising the processing unitaccording to claim 31 and the first memory.
 34. A processing deviceaccording to claim 33, further comprising a control unit that indicatesa change to at least part of the at least one data flow in theprocessing unit.
 35. A control method for an integrated circuit devicethat includes a first memory that is capable of inputting data intoand/or outputting data from a second memory and a processing unit inwhich at least one data flow is formed and in which at least part of theat least one data flow is changeable, the control method comprising astep of instructing the processing unit to configure a data processingsection that processes data that is inputted from and/or outputted tothe first memory, a first address outputting section that outputs afirst address of data that is inputted and/or outputted between thefirst memory and the data processing section, and a second addressoutputting section that outputs a second address of data that isinputted and/or outputted between the first memory and the secondmemory.
 36. A control method according to claim 35, including in thestep of instructing, a step of independently indicating changes to thedata flow of the data processing section, the first address outputtingsection, or the second address outputting section.
 37. A control methodaccording to claim 35, including in the step of instructing, instructingto configure the second address outputting section for operatingindependently of the data processing section and/or the first addressoutputting section.
 38. A control method according to claim 35, whereina plurality of data flows are configured in the data processing section,and including in the step of instructing, instructing to form a pair ofa first address outputting section and second address outputting sectionrespectively corresponding to each of a plurality of first memories. 39.A control method according to claim 35, further comprising an executingstep of forming the at least one data flow in the data processingsection and executing processing that is related to data inputted intoand/or outputted from the first memory, and including in the executingstep, processing of the at least one data flow formed in the dataprocessing section is stopped using a stop signal that is outputted by afirst arbitrating unit, which manages inputs and outputs between thefirst memory and the data processing section, when conditions forinputting or outputting are not satisfied.
 40. A control method for anintegrated circuit device that includes a first memory that is capableof inputting data into and/or outputting data from a second memory and aprocessing unit in which at least one data flow, which processes datathat is inputted into or outputted from the first memory, is formed, thecontrol method comprising an executing step of executing processingrelated to data that is inputted into and/or outputted out of the firstmemory, and including in the executing step, processing of the at leastone data flow is stopped according a stop signal that is outputted by afirst arbitrating unit, which manages inputs and outputs between thefirst memory and the data processing section, when conditions forinputting or outputting are not satisfied.